72

Algorithms for Binary Neural Networks

for 1-bit CNNs as

LB = λ

2

L



l=1

Cl

o



i=1

Cl

i



n=1



||ˆkl,i

n wlkl,i

n ||2

2

+ ν(kl,i

n + μl

i+)T (Ψl

i+)1(kl,i

n + μl

i+)

+ ν(kl,i

n+ μl

i)T (Ψl

i)1(kl,i

n+ μl

i)

+ ν log(det(Ψl))



+ θ

2

M



m=1



||fmcm||2

2

+

Nf



n=1

σ2

m,n(fm,n cm,n)2 + log(σ2

m,n)



,

(3.108)

where kl,i

n , l ∈{1, ..., L}, i ∈{1, ..., Cl

o}, n ∈{1, ..., Cl

i}, is the vectorization of the i-th kernel

matrix at the l-th convolutional layer, wl is a vector used to modulate kl,i

n , and μl

i and Ψl

i

are the mean and covariance of the i-th kernel vector at the l-th layer, respectively. And

we term LB the Bayesian optimization loss. Furthermore, we assume that the parameters

in the same kernel are independent. Thus Ψl

i becomes a diagonal matrix with the identical

value (σl

i)2, where (σl

i)2 is the variance of the i-th kernel of the l-th layer. In this case, the

calculation of the inverse of Ψl

i is sped up, and all the elements of μl

i are identical and equal

to μl

i. Note that in our implementation, all elements of wl are replaced by their average

during the forward process. Accordingly, only a scalar instead of a matrix is involved in the

inference, and thus the computation is significantly accelerated.

After training 1-bit CNNs, Bayesian pruning loss LP is then used for the optimization

of feature channels, which can be written as:

LP =

L



l=1

Jl



j=1

Ij



i=1



||Kl

i,j K

l

j||2

2

+ ν(Kl

i,j K

l

j)T (Ψl

j)1(Kl

i,j K

l

j) + ν log



det(Ψl

j)



,

(3.109)

where Jl is the number of Gaussian clusters (groups) of the l-th layer, and Kl

i,j, i =

1, 2, ..., Ij, are those Kl

i’s that belong to the j-th group. In our implementation, we define

Jl = int(Cl

o × ϵ), where ϵ is a predefined pruning rate. In this chapter, we use one ϵ for all

layers. Note that when the j-th Gaussian just has one sample Kl

i,j,K

l

j = Kl

i,j and Ψj is a

unit matrix.

In BONNs, the cross-entropy loss LS, the Bayesian optimization loss LB, and the

Bayesian pruning loss LP are aggregated together to build the total loss as:

L = LS + LB + ζLP ,

(3.110)

where ζ is 0 in binarization training and becomes 1 in pruning. The loss of Bayesian kernels

constrains the distribution of the convolution kernels to a symmetric Gaussian mixture with

two modes. It simultaneously minimizes the quantization error through the ||ˆkl,i

n wlkl,i

n ||2

2

term. Meanwhile, the Bayesian feature loss modifies the distribution of the features to reduce

intraclass variation for better classification. The Bayesian pruning loss converges kernels

similar to their means and thus compresses the 1-bit CNNs further.

3.7.5

Forward Propagation

In forward propagation, the binarized kernels and activations accelerate the convolution

computation. The reconstruction vector is essential for 1-bit CNNs as described in Eq. 3.97,